Chapter 15 · Full-Text Search

Full-Text Search & Elasticsearch

From a simple ILIKE query that suffocates under millions of rows, to relevance-ranked millisecond search — this chapter explains why tools like Elasticsearch exist, how the inverted index works, what BM25 does, and when to reach for Postgres FTS versus Elasticsearch in a real backend.

The Origin Story — Why Search Got Hard

Imagine it's 2005. You're a backend engineer at a fast-growing e-commerce startup. The product catalog has ~5,000 items. You write a search endpoint using a classic SQL pattern:

SQL-- Works great at 5 000 rows
SELECT * FROM products
WHERE  name        ILIKE '%laptop%'
   OR  description ILIKE '%laptop%';

This returns results in roughly 50 ms. Users are happy, managers are happy, life is simple.

Fast-forward two years. The catalog explodes to millions of products. The same query now takes 30 seconds. The company is losing sales. On top of the speed issue, new requirements arrive:

Relevance — searching "laptop" should show a MacBook Pro before a laptop bag.
Typo tolerance — users type lapto during a sale rush; results must still appear.
Speed — search must return in milliseconds, not seconds.

These three requirements — speed, relevance, typo-tolerance — are exactly the founding charter of dedicated full-text search engines.

The Librarian Analogy

Think of your relational database (Postgres, MySQL, etc.) as a librarian in a giant library. The librarian knows exactly where every book lives on the shelf.

But it has one fatal flaw: to find books about a topic, it must physically walk to every single shelf, pull out every book, and read it page by page looking for your keyword.

Fig 1 — Postgres ILIKE performs a full sequential scan: every row is examined individually.

The two problems summarised:

Speed: A table with 50 million rows means 50 million row comparisons. On spinning disk this was catastrophic even in the SSD era it wastes enormous I/O.
No relevance: A book whose title is "Introduction to Machine Learning" ranks identically to a book that merely mentions "machine" once in the appendix. The database has zero concept of importance.

Why `ILIKE '%term%'` Cannot Use Indexes

Postgres B-tree indexes work by sorting values. A B-tree can locate name LIKE 'lapt%' (prefix match) efficiently because sorted strings share prefixes. But ILIKE '%laptop%' has a leading wildcard — there is no useful prefix to sort on. The engine must fall back to a sequential scan.

A leading % in a LIKE/ILIKE expression disables B-tree index usage. Postgres will read every single page of the heap.

What the query plan looks like

SQL — EXPLAIN ANALYZEEXPLAIN ANALYZE
SELECT id, name
FROM  products
WHERE name ILIKE '%laptop%';

-- Output (simplified)
-- Seq Scan on products  (cost=0.00..18450.00 rows=5 width=36)
--   Filter: ((name)::text ~~* '%laptop%'::text)
--   Rows Removed by Filter: 4 999 995
-- Planning Time: 0.2 ms
-- Execution Time: 28 940.7 ms   ← 29 seconds!

Notice Seq Scan — every row examined, 4,999,995 rows thrown away, execution time ~29 s.

The Inverted Index — The Core Invention

"Instead of going through every document to find the term, maintain a map from every term to the documents that contain it."

This insight — inverting the search — is what the name captures. A normal index maps document → words. An inverted index maps word → documents.

How It Is Built

When a document is first stored (or updated), the search engine runs it through an analysis pipeline:

Tokenisation — break text into individual tokens (words). "Introduction to Machine Learning" → ["introduction", "to", "machine", "learning"]
Normalisation / Lowercasing — convert to lowercase so "Machine" and "machine" are the same token.
Stop-word removal — drop common words with no discriminative value ("to", "the", "a"). (Optional, configurable.)
Stemming / Lemmatisation — reduce words to their root: "running" → "run", "searches" → "search". So a query for "searching" also matches "searched".
Index entry creation — for each resulting term, record the document ID and position.

Fig 2 — Inverted index: term → posting list (document IDs + positions). Lookup is O(1) hash or O(log n) B-tree.

The key insight: inverted index lookup time depends on the number of matching documents, not the total number of documents. Millions of documents = no problem, as long as the hit list is small.

Elasticsearch — What It Is & How It Works

Elasticsearch is a distributed, JSON-document search engine built on top of Apache Lucene — the battle-tested Java library that implements the inverted index and BM25 scoring. Lucene has been around since 1999; Elasticsearch (2010) wraps it in a REST API and adds distributed clustering, replication, and a rich query DSL.

Key characteristics:

Schema-less JSON documents — like MongoDB, you store JSON. Fields are auto-detected (or explicitly mapped).
Near real-time — newly indexed documents are searchable within ~1 second (configurable).
Horizontal scaling — data is split into shards (sub-indexes), each shard is a complete Lucene index. More nodes = more throughput.
Powerful query DSL — JSON-based. Supports full-text, fuzzy, range, geo-distance, aggregations, etc.

Creating an Index with Field Mapping

HTTP / Elasticsearch DSLPUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",          // analysed — tokenised, stemmed
        "boost": 3            // title matches are 3× more relevant
      },
      "description": {
        "type": "text",
        "boost": 1.5
      },
      "category": {
        "type": "keyword"     // not analysed — exact match only
      },
      "price": { "type": "float" }
    }
  }
}

The difference between text and keyword is critical:

Mapping Type	Analysed?	Use For
`text`	Yes — tokenised, stemmed	Product names, descriptions, reviews
`keyword`	No — stored as-is	Status codes, tags, category IDs

A Full-Text Search Query

HTTP / Elasticsearch DSLGET /products/_search
{
  "query": {
    "multi_match": {
      "query": "laptop",
      "fields": ["name^3", "description^1.5", "category"],
      "fuzziness": "AUTO"    // typo tolerance
    }
  },
  "size": 10
}

^3 is field boosting — matches in name contribute 3× as much to the relevance score as matches in the default field weight.

BM25 — How Relevance Scoring Works

BM25 (Best Match 25) is the default ranking function in Elasticsearch (and also in Postgres FTS). It produces a floating-point score per document; higher = more relevant. The full formula has several components — but conceptually four factors drive the score:

Fig 3 — The four pillars of BM25 relevance scoring in Elasticsearch.

Factor	What It Measures	Effect on Score
Term Frequency	Count of query term in document	Higher = more relevant (with diminishing returns)
Inverse Doc Frequency	Rarity of term across index	Rare terms carry more signal than common ones
Document Length	Length vs. avg document length	Short docs with the term rank higher than long docs
Field Boost	Which field the term appears in	Title match > body match (configurable multiplier)

You don't need to understand the BM25 math to use Elasticsearch productively. Just know: field-boosted + term-frequency + IDF = the score. Tune boosts in mapping and query DSL.

Why IDF Matters

Consider the word "the" — it appears in virtually every English document, so it has very high document frequency and a very low IDF weight. It contributes almost nothing to scores. A rare word like "photovoltaic" or "elasticsearch" appearing in a document is a strong signal of topical relevance.

Typo Tolerance — Fuzzy Search

Elasticsearch uses Levenshtein edit distance under the hood for fuzzy matching. Edit distance is the minimum number of single-character operations (insert, delete, substitute) needed to transform one word into another.

lapto → laptop: 1 insertion → edit distance 1 ✓
treading → trending: 1 substitution → edit distance 1 ✓
mcbook → macbook: 1 insertion → edit distance 1 ✓

Elasticsearch DSL — fuzzy query{
  "query": {
    "fuzzy": {
      "name": {
        "value": "lapto",
        "fuzziness": "AUTO",  // 0 for 1-2 chars, 1 for 3-5, 2 for 6+
        "prefix_length": 1       // first N chars must match exactly
      }
    }
  }
}

"fuzziness": "AUTO" is the recommended setting. It automatically adjusts the allowed edit distance based on word length. prefix_length: 1 prevents matching completely unrelated words that happen to be 1–2 edits away.

Type-ahead / autocomplete interfaces like those on Amazon or Google use this capability. As you type "trendin", Elasticsearch finds "trending" with fuzziness=1 and returns suggestions in <50 ms.

The ELK Stack — Logs & Observability

Elasticsearch isn't only for product search. It's the "E" in the famous ELK Stack used for centralised log management and observability.

Fig 4 — The ELK Stack. Logstash ingests & parses, Elasticsearch indexes, Kibana visualises.

If your company already runs ELK for logging, adding product/content search on top of the same Elasticsearch cluster is a natural, cost-effective choice — you're already paying for the infrastructure and your ops team already knows it.

Postgres Full-Text Search in Go

Postgres has had native full-text search since version 8.3 via the tsvector / tsquery types. It uses its own inverted index (the GIN index) and a BM25-variant scorer called ts_rank. It's not as feature-rich as Elasticsearch but perfectly adequate for moderate search needs on data already in Postgres.

DDL — Add a GIN Index

SQL-- Add a generated tsvector column and index it
ALTER TABLE products
  ADD COLUMN search_vec tsvector
    GENERATED ALWAYS AS (
      setweight(to_tsvector('english', coalesce(name, '')), 'A') ||
      setweight(to_tsvector('english', coalesce(description, '')), 'B')
    ) STORED;

CREATE INDEX idx_products_fts ON products USING GIN(search_vec);

setweight('A') means title matches outrank description matches — the Postgres equivalent of field boosting.

Go — Full-Text Search Handler

Gopackage main

import (
    "context"
    "fmt"
    "log"
    "net/http"

    "github.com/jackc/pgx/v5/pgxpool"
)

type Product struct {
    ID          int
    Name        string
    Description string
    Rank        float64
}

func searchHandler(pool *pgxpool.Pool) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        query := r.URL.Query().Get("q")
        if query == "" {
            http.Error(w, "missing query param q", http.StatusBadRequest)
            return
        }

        // plainto_tsquery converts plain text safely (no special chars needed)
        // websearch_to_tsquery also supports AND/OR/-term syntax
        sql := `
            SELECT id, name, description,
                   ts_rank(search_vec, plainto_tsquery('english', $1)) AS rank
            FROM   products
            WHERE  search_vec @@ plainto_tsquery('english', $1)
            ORDER  BY rank DESC
            LIMIT  20
        `

        rows, err := pool.Query(context.Background(), sql, query)
        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }
        defer rows.Close()

        var results []Product
        for rows.Next() {
            var p Product
            rows.Scan(&p.ID, &p.Name, &p.Description, &p.Rank)
            results = append(results, p)
        }

        w.Header().Set("Content-Type", "application/json")
        fmt.Fprintf(w, "found %d results\n", len(results))
    }
}

func main() {
    pool, _ := pgxpool.New(context.Background(), "postgres://...")
    http.HandleFunc("/search", searchHandler(pool))
    log.Fatal(http.ListenAndServe(":8080", nil))
}

The @@ operator is Postgres's full-text match operator. It uses the GIN index — query time stays O(log n) regardless of table size.

Elasticsearch in Python (Official SDK)

The official Python client is elasticsearch-py. Install:

Shellpip install elasticsearch

Index Setup & Bulk Insert

Pythonfrom elasticsearch import Elasticsearch, helpers
import csv

es = Elasticsearch(
    cloud_id="YOUR_CLOUD_ID",
    api_key="YOUR_API_KEY"
)

INDEX = "reviews"

# 1. Create index with explicit mapping
if es.indices.exists(index=INDEX):
    es.indices.delete(index=INDEX)

es.indices.create(index=INDEX, body={
    "mappings": {
        "properties": {
            "review":    {"type": "text"},     # analysed full-text
            "sentiment": {"type": "keyword"}  # exact: "positive" / "negative"
        }
    }
})

# 2. Bulk insert from CSV
def generate_docs(path: str):
    with open(path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            if row.get("review") and row.get("sentiment"):
                yield {
                    "_index": INDEX,
                    "_source": {
                        "review":    row["review"],
                        "sentiment": row["sentiment"]
                    }
                }

success, errors = helpers.bulk(es, generate_docs("reviews.csv"))
print(f"Inserted {success} documents, {len(errors)} errors")

Search with Fuzzy + Field Boost

Pythondef search_reviews(query: str, sentiment_filter: str = None) -> list:
    must_clauses = [{
        "multi_match": {
            "query":     query.lower(),
            "fields":   ["review"],
            "fuzziness": "AUTO",
            "operator":  "and"   # all terms must appear
        }
    }]

    filter_clauses = []
    if sentiment_filter:
        filter_clauses.append({
            "term": {"sentiment": sentiment_filter}
        })

    body = {
        "query": {"bool": {"must": must_clauses, "filter": filter_clauses}},
        "size": 20
    }

    resp = es.search(index=INDEX, body=body)
    return [
        {"score": hit["_score"], **hit["_source"]}
        for hit in resp["hits"]["hits"]
    ]

# Usage
results = search_reviews("gret product", sentiment_filter="positive")
# "gret" (typo for "great") matched via fuzziness=AUTO (edit distance 1)

Streaming Results from Two Sources (Node.js Pattern)

The demo in the lecture used a Next.js API route that streams results from Postgres and Elasticsearch simultaneously so neither waits for the other. Here is the pattern in pure Node.js:

Node.js (ESM)import { Client } from '@elastic/elasticsearch'
import { neon }     from '@neondatabase/serverless'

export async function GET(req) {
  const { searchParams } = new URL(req.url)
  const term = searchParams.get('q')

  const encoder = new TextEncoder()
  const stream = new ReadableStream({
    async start(controller) {
      // Fire BOTH queries concurrently — don't await sequentially
      const [pgResult, esResult] = await Promise.allSettled([
        pgSearch(term),
        esSearch(term),
      ])

      controller.enqueue(encoder.encode(JSON.stringify({
        source: 'postgres',
        ...pgResult
      })))
      controller.enqueue(encoder.encode(JSON.stringify({
        source: 'elasticsearch',
        ...esResult
      })))
      controller.close()
    }
  })
  return new Response(stream, { headers: { 'Content-Type': 'text/event-stream' } })
}

The Benchmark — ILIKE vs Elasticsearch on 50k Rows

The demo in the lecture populated 50,000 review documents into both Neon (cloud Postgres) and Elastic Cloud, both hosted in us-west to keep network latency equal. Results:

Fig 5 — Postgres ILIKE scales poorly with wider matches; Elasticsearch stays flat at ~500ms.

Query	Elasticsearch	Postgres ILIKE	Speedup
"laptop"	~500 ms	~3 s	6×
"only" (broad)	~500 ms	~7.5 s	15×

The broader the search term (more matching rows), the worse ILIKE gets — because Postgres must retrieve and return all those rows from disk. Elasticsearch computes relevance scores in-memory on the shard and only returns the top-N, so latency stays low.

When to Use What

Postgres Full-Text Search

Data already in Postgres
Search is a secondary feature, not the core product
Team small / no dedicated infra for ES
Moderate data volume (< a few million rows)
Don't need fuzzy/typo search on day one

Elasticsearch

Search is a first-class feature
Millions+ of documents
Need typo tolerance / autocomplete
Complex relevance tuning needed
Already running ELK for logs anyway
Need aggregations / analytics on search results

As the lecturer noted: "You don't have to master Elasticsearch the way you must master databases. Most search use cases are covered by copy-pasting from the official docs or an LLM." Know when to reach for the tool; the tool's docs will handle the rest.

Decision Flow

Fig 6 — Quick decision tree for choosing your search backend.

References & Further Reading

📖 Elasticsearch Official Docs 📖 Postgres Full-Text Search 📖 Elasticsearch Query DSL 📐 BM25 Similarity Config ⚙️ Apache Lucene 🔬 Analysis & Tokenisation 📦 ELK Stack Overview 🐹 pgx Go Driver Docs 🐍 Python ES Client

Backend Field Manual · Full-Text Search Chapter · Notes compiled from video lecture

Full-Text Search & Elasticsearch

The Origin Story — Why Search Got Hard

The Librarian Analogy

Why ILIKE '%term%' Cannot Use Indexes

What the query plan looks like

The Inverted Index — The Core Invention

How It Is Built

Elasticsearch — What It Is & How It Works

Creating an Index with Field Mapping

A Full-Text Search Query

BM25 — How Relevance Scoring Works

Why IDF Matters

Typo Tolerance — Fuzzy Search

The ELK Stack — Logs & Observability

Postgres Full-Text Search in Go

DDL — Add a GIN Index

Go — Full-Text Search Handler

Elasticsearch in Python (Official SDK)

Index Setup & Bulk Insert

Search with Fuzzy + Field Boost

Streaming Results from Two Sources (Node.js Pattern)

The Benchmark — ILIKE vs Elasticsearch on 50k Rows

When to Use What

Postgres Full-Text Search

Elasticsearch

Decision Flow

References & Further Reading

Why `ILIKE '%term%'` Cannot Use Indexes